Kinesis Analytics

Querying streams of data continuously
Can receive data from and send results to
- Kinesis Data Streams
- Kinesis Data Firehose
Analysis using SQL
Can use reference table from S3 to help analysis
Errors will be output to Error Stream

Kinesis Analytics Use Cases

Streaming ETL
Continuous metric generation
Responsive analytics

Schema discovery

Generate data schema automatically by feeding some stream data

RANDOM_CUT_FOREST

A SQL function offered by Kinesis Data Analytics
For anomaly detection

Amazon Elasticsearch Service (ES)

Petabyte-scale search, analysis, and reporting tools
Fundamentally based on a search engine (Lucine)
ES can be regarded as a analysis tool
- Has a visualization tool (Kibana)
Can use data pipeline to send stream data to ES
- Kinesis
- Beats, LogStash, Apache Kafka
- ES API
Horizontally scalable

ES Use Cases

Text search
Log analytics
Application monitoring
Security analytics
Click stream analytics

ES Concepts

Documents
- Text or JSON
Types
- Define th schema and mapping shared by documents
Indices
- Power search into all documents within a collection of types
- An index is split into shard
  - A primary shard and replicas
  - Write requests are routed to the primary shard, then replicated
  - Read requests are routed any shards

ES Features

Full-managed but not serverless, ES runs based on EC2
- Can scale up or down without downtime but should do it manually
Network isolation (VPC)
- Can use all the features of VPC
AWS integration
- AWS IoT
- S3 via Lambda to Kinesis
- Kinesis Data Streams
- DynamoDB Streams

ES Options

Dedicated master node(s)
- Only used for the management and dose not hold or process data
- Decide numbers and instance types
Domains
- A collection of all the resources of a ES cluster
Automatic snapshots to S3

ES Security

Resource, identity, IP-based policies
Request signing
VPC
Cognito

ES Anti-patterns

OLTP
- RDS or DynamoDB
Ad-hoc data querying
- Athena

Amazon Athena

Interactive query service for S3 in SQL
Use Presto
Supported data formats
- CSV
- JSON
- ORC (columnar, splittable)
- Parquet (columnar, splittable)
- Avro (splittable)

Athena Integration

Jupyter, Zeppelin, RStudio notebooks
QuickSight
Other visualization tools via ODBC / JDBC

Athena with Glue

Use Glue to define unstructured data in S3

Athena Cost Model

Successful and cancelled queries count, failed queries not
No charge for DDL
Save money by using columnar formats
- ORC, Parquet
- And better performance
Glue and S3 charge separately

Athena Security

Access control
- IAM, ACLs, S3 bucket policies
- Cross-account access in S3 bucket policy possible
Encrypt results at rest in S3 staging directory
TLS encrypts in transit

Athena Anti-patterns

Highly formatted reports / visualization
- Use QuickSight
ETL
- Use Glue

Amazon Redshift

Fully-managed, petabyte scale data warehouse service
Designed for OLAP, not OLTP
SQL, ODBC, JDBC interfaces just like other RDB
Easily scale up and down manually
Built-in replication & backups

Redshift Architecture

Redshift Cluster
- A leader node
  - Manage communication with clients & compute nodes
  - Receives queries form clients & develops execution plans
  - Coordinates the parallel execution of those plans with compute nodes
  - Aggregates the intermediate results from compute nodes
- 1~128 compute nodes
  - Store user data
  - Execute the steps in the execution plans
  - Can transmit data among themselves
  - Node types
    - Dense storage (DS): xl or 8xl
      - HDD
      - Low costs
    - Dense compute (DC): xl or 8xl
      - SSD
      - Larger memory
      - Faster CPU
  - Slices
    - Each compute node is divided into slices
    - A slice allocates a portion of the memory and disk storage of the node
    - Size of slices is determined by the node size of the cluster

Redshift Spectrum

Query unstructured data in S3 like Redshift table without loading
Limitless concurrency & horizontal scaling
Support wide variety of data formats
Support Gzip and Snappy compression

Redshift Performance

Massively Parallel Processing (MPP)
Columnar Data Storage
Column Compression

Redshift Durability

Redshift has 3 copies of data
- An original copy within the cluster
- A backup repica copy within the cluster
- Continuously backed up to S3
  - Can furthermore asynchronously replicated to another region
    - Default retention period is 1 day, up to 35 days, 0 to turn off
Redshift will mirror each drive’s data to another nodes within the cluster if there are 2 or more compute nodes
- Can detect a failed drive or node, and replace if automatically
- Drive failure
  - Redshift will remain available but performance may be declined
- Node failure
  - Redshift will be unavailable during recovery
  - Most frequently access data from S3 is loaded first which you can querying as quickly as possible
Redshift is limited to a single AZ
- You have to restore the data from S3 in a different AZ when facing AZ failure

Redshift Scaling

Vertical and horizontal scaling on demand
During scaling
- A nwe cluster is created while old cluster remains available for reads
- CNMAE is flipped to new cluster with a few minutes of downtime
- Data moved in parallel to new compute nodes

Redshift Distribution Styles

AUTO
- Default Style, choose one of EVEN, KEY, ALL depends on the size of data
EVEN
- Rows distributed across slices in round-robin
KEY
- Rows distributed based on a column hash
ALL
- Entire table is copied to every node

Redshift Sort Keys

Rows are stored on disk in sorted order based on the column designated as a sort key
- Like an index
- Makes for fast range queries
Single vs. Compound vs. Interleaved sort keys
- Single
- Compound (default): designate all the columns as sort keys
  - Performance will decrease when queries depend only on secondary sort key without referencing the primary sort key
  - The order is important
  - Improve compression performance
- Interleaved: give equal weight to each column or subset of columns in the sort key
  - When multiple queries use different columns for filters

Redshift Import / Export Data

COPY command
- Parallel
- From S3, DynamoDB, EMR / EC2 / other remote hosts vis SSH
- Copy from S3
  - Use S3 object prefix or path
  - Manifest file
- Authorization
  - IAM role based
  - Key based
UNLOAD command
- Efficient way to unload from a table into files in S3
Enhanced VPC routing
- Force all the traffic in a VPC than public Internet

Redshift Copy Grant for Cross-region Snapshot Copy

In the destination AWS region
- Create a KMS key or use an old one in the destination region
- Specify the KMS key ID for the copy grant in the destination region
- Specify an unique name for the copy grant and enable cross-region snapshot in the source region

In the source AWS region

Enable copying of snapshots to the copy grant

DBLINK

Connect Redshift to PostgreSQL
Used to copy and sync data between PostgreSQL and Redshift

Redshift Integration

S3
- COPY / UNLOAD command
DynamoDB
- COPY command
EMR / EC2
- COPY command via SSH
Data Pipeline
Database Migration Service (DMS)

Redshift Workload Management (WLM)

Manage query priorities using query queues
Avoid short, fast queries being stuck by long, slow queries
Setting by console, CLI, or API

Redshift VACUUM Command

Clean table
VACUUM FULL (default)
- Resort rows and reclaim space from deleted rows
VACUUM DELETE ONLY
VACUUM SORT ONLY
VACUUM REINDEX
- Reinitialize Interleaved sort keys, then do VACUUM FULL

Redshift Security

Database can be encrypted using KMS or HSM

Redshift Anti-patterns

Small data sets
- Use RDS
OLTP
- Use RDS or DynamoDB
Unstructured data
- ETL first with EMR or Glue
- Or use Redshift Spectrum
BLOB data
- Use S3

Amazon RDS

Hosted relational database service
Not for big data

ACID

RDS offer full ACID compliance
- Atomicity
- Consistency
- Isolation
- Durability

Amazon Aurora

Up to 64TB per database instance
Up to 15 read replicas
Can continuous backup to S3
Can auto scaling with Aurora serverless

Aurora Security

In VPC
At-rest with KMS
In-transit with SSL